专利摘要:
Summary Estimation method for estimating a fetal fraction, which method comprises measuring allelic presence for a predetermined number of genetic markers in a sample of cell-free DNA from a pregnant woman, each allele presence representing the presence on a genetic marker of the predetermined number genetic markers of at least one of: a reference allele of maternal or fetal origin, and an alternate allele of maternal or fetal origin; based on the measured allelic presence, calculating a corresponding number of allele frequencies for the predetermined number of genetic markers; based on the calculated number of allele frequencies, detecting a subset of the predetermined number of genetic markers, which subset is associated with heterozygous allele pairs of maternal origin; and estimating the fetal fraction based on the detected subset, wherein the fetal fraction represents the fraction of cell-free DNA of fetal origin in the sample.
公开号:BE1023274B1
申请号:E2015/5460
申请日:2015-07-17
公开日:2017-01-19
发明作者:Paul Vauterin;Michaël Vyverman;Schrijver Joachim De
申请人:Multiplicom Nv;
IPC主号:
专利说明:

Estimation method and system for estimating a fetal fraction
Domain of the invention
The domain of the present invention relates to the estimation of a fetal fraction. Particular embodiments relate to an estimation method and system for estimating a fetal fraction, and to a computer program product and a digital data storage medium for estimating a fetal fraction.
Background EP 0 994 063 discloses non-invasive prenatal diagnostic methods, for example to detect certain diseases. The prenatal diagnostic methods include estimating the fetal fraction, i.e., the fraction of cell-free DNA (deoxyribonucleic acid) in the maternal sample that actually comes from the fetus. Known estimation methods for estimating the fetal fraction include the detection of SNP (or single-nucleotide polymorphism, single-nucleotide polymorphisms, i.e. a genetic marker comprising a single variable nucleotide) alleles (i.e. variants or alternative forms of the same gene or for the same genetic marking) on the fetal cell-free DNA that are not present in the maternal DNA. In other words, these known estimation methods investigate SNP alleles in which the matte DNA is homozygous.
Summary
Embodiments of the invention aim to provide an approach for estimating the fetal fraction.
According to a first aspect of the invention, an estimation method is provided for estimating a fetal fraction. The method comprises measuring allele presence for a predetermined number of genetic markers in a sample of cell-free DNA from a pregnant woman, wherein each allele presence represents the presence on a genetic mark of the predetermined number of genetic markings of at least one of: a reference -allel of maternal or fetal origin, and an alternative allele of matemal or fetal origin. For the sake of completeness, it is noted that this does not imply that a distinction can be made between an allele of matematic origin and an allele of fetal origin, but only that an allele of one of the two origins can be measured. The method also comprises, on the basis of the measured allele presence, the calculation of a corresponding number of allele frequencies for the predetermined number of genetic markers. The method also comprises, on the basis of the known number of allele frequencies, detecting a subset of the predetermined number of genetic markers, wherein the subset is associated with heterozygous allele pairs of matematic origin. The method also includes estimating the fetal fraction based on the detected subset, the fetal fraction representing the fraction of cell-free DNA of fetal origin in the sample.
In the context of this specification, a "genetic marker" is a position on the genome that is known to assume various possible states about individuals in a population.
By providing this estimation method, embodiments of the invention provide an approach to estimating the fetal fraction.
Further developed embodiments may rest, inter alia, on the inventive insight that estimating the fetal fraction based on the detected subset allows to verify the fetal fraction as estimated in known estimation methods, independently of investigated SNP alleles where the matte DNA is homozygous. this verification improves the accuracy of a fetal fraction as estimated in known estimation methods.
According to a preferred embodiment, on the basis of the calculated number of allele frequencies, detecting the subset comprises selecting from the predetermined number of genetic markers, as that subset of a second number of genetic markers whose corresponding allele frequencies exceed a predetermined minimum allele frequency threshold. According to a further developed embodiment, the corresponding allele frequencies of the second number of genetic markers do not exceed a predetermined maximum allele frequency threshold. According to a further developed preferred embodiment, the corresponding allele frequencies of the second number of genetic markers exceed the predetermined minimum allele frequency threshold and do not exceed the predetermined maximum allele frequency threshold.
In this way, multiple heterozygous genetic markers can be detected in a more direct manner, without detecting alleles of fetal origin that are not present in the DNA of matematic origin.
According to a preferred embodiment, estimating comprises producing an initial estimate for the fetal fraction and optimizing the initial estimate over the detected subset.
In this way an estimate is immediately available, and better estimates can be obtained by using more computing power.
According to a further developed embodiment, optimizing comprises maximizing the probability of the initial estimate, by repeatedly performing the next set of operations while the estimate is varied. For a given value of the estimate that is varied, for each genetic marker in the detected subset, the respective probabilities are calculated that the genetic marker is associated with a homozygous reference allele pair of fetal origin, a heterozygous allele pair of fetal origin, and a homozygous alternative allel pair of fetal origin. Likewise, a probability of the given value of the estimate being varied is calculated for the detected subset, based on the calculated respective probabilities for each genetic marker in the detected subset.
In this way, a structural strategy is used to search for a better estimate for the fetal fraction, taking into account the genetic markers in the detected subset.
According to a preferred embodiment, the method comprises correcting allele-specific bias for at least one of the predetermined number of genetic markers, the allele-specific bias being due to unequal amplification for a reference allele and an alternative allele for a genetic marking, respectively. .
In this way accuracy is improved, regardless of allele-specific prejudice.
According to a further developed embodiment, correcting an allele-specific bias for a genetic marking comprises obtaining multiple samples of cell-free DNA from pregnant women, for which samples the genetic markings are respectively associated with heterozygous allel pairs of matematic origin. The correction also includes calculating the average allele frequency for the genetic marker over the plurality of samples. The correction also comprises, on the basis of a deviation of the calculated average allele frequency from an expected value thereof, the determination of a correction vector to be used for the correction.
In this way allele-specific prejudice is efficiently corrected.
According to a preferred embodiment, the method comprises limiting the detected subset by: determining for at least one genetic tag of the detected subset a statistical probability of being associated with a homozygous allele pair of matemal origin; and, based on the determined statistical probability or probabilities, the exclusion of at least one genetic marker from the detected subset. In other words, this will allow to omit genetic markers that are probably associated with a homozygous allele of matemal origin.
In this way the accuracy is improved by using higher quality genetic markers.
According to a preferred embodiment, the method comprises limiting the detected subset by: obtaining a plurality of samples of cell-free DNA from pregnant women, for which samples a genetic marking is associated with heterozygous allele pairs of maternal origin; calculating the standard deviation of the allele frequency for the genetic marking over the plurality of samples; and based on the calculated standard deviation of the allele frequency, excluding at least one genetic marker from the detected subset.
In this way the accuracy is improved by using higher quality genetic markers.
According to a preferred embodiment, the method comprises limiting the detected subset by: obtaining a plurality of samples of cell-free DNA from pregnant women, for which samples the genetic marking is respectively associated with heterozygous allel pairs of matematic origin; calculating the average allele frequency for the genetic marker over the plurality of samples; and on the basis of a deviation of the calculated average allele frequency from an expectation value thereof, excluding at least one genetic marker from the detected subset.
In this way the accuracy is improved by using higher quality genetic markers.
According to a preferred embodiment, the method comprises limiting the detected subset by excluding at least one genetic tag from the detected subset, which has at least one genetic tag fewer measured allelic events than a predetermined threshold value for the sample used.
In this way the accuracy is improved by using higher quality genetic markers.
In a preferred embodiment, the sample is blood, plasma, urine, cerebrospinal fluid, serum, saliva, or transcervical flushing fluid.
In this way the method is applicable for a multitude of samples that can be created from a pregnant woman.
In a preferred embodiment, the measurement comprises processing the sample using at least one of the following: polymerase chain reaction (PCR), ligase chain reaction, nucleic acid sequence-based amplification (NASBA), and branched DNA methods; and preferably PCR.
In exemplary embodiments of the invention, measuring allelic presentations may include measuring SNP allelic presentations and / or measuring allelic presentations for short insertions and / or deletions.
According to a further aspect of the invention, a computer program product is provided comprising computer executable instructions to, when the program is executed on a computer, perform at least the step of estimating the fetal fraction of embodiments of the method disclosed above. The reference to computer executable instruction must be interpreted to include both directly executable machine code, code that must be compiled to be executed, and code that is interpreted rather than necessarily executed.
According to a further aspect of the invention, a digital data storage medium is provided that encodes a machine-executable program of instructions to perform at least the step of estimating the fetal fraction of embodiments of the method disclosed above.
According to a further aspect of the invention, a computer program is provided comprising computer-executable instructions to, when the program is executed on a computer, to perform one or more steps of embodiments of the method disclosed above. According to a further aspect of the invention, a computer device or other hardware device is provided programmed to perform one or more steps of any of the embodiments of the method disclosed above. In another aspect, a data storage device is provided that encodes a program in machine-readable and machine-executable form to perform one or more steps of any of the embodiments of the method disclosed above.
According to a further aspect of the invention, an estimation system for estimating a fetal fraction is provided. The system comprises a measurement module, a calculation module, a detection module and an estimation module. The measurement module is adapted to measure allele presence for a predetermined number of genetic markers in a sample of cell-free DNA from a pregnant woman, each allele presence representing the presence on a genetic mark of the predetermined number of genetic markings at least one of: a reference -allel of matemal or fetal origin, and an alternative allele of matemal or fetal origin. The calculation module is arranged for: on the basis of the measured allele presence, calculating a corresponding number of allele frequencies for the predetermined number of genetic markers, the detection module is arranged for, on the basis of the calculated number of allele frequencies, detecting a subset of the predetermined allele frequencies certain number of genetic markers, which subset is associated with heterozygous allel pairs of matemal origin. The estimation module is adapted to estimate the fetal fraction based on the detected subset, the fetal fraction representing the fraction of cell-free DNA of fetal origin in the sample.
Embodiments of the system can be embodied as a whole in hardware, in software, or in a combination of hardware and software. In a preferred embodiment of the system, the measurement module may comprise a hardware device adapted to obtain the sample and to perform the measurements on the sample (typically software-controlled). The calculation module can be provided as a piece of software that can be run on a computer processor coupled to the measurement module, or can be otherwise adapted to obtain the measured allele events produced by the measurement module, for example via a data communication channel from the measurement module to the computer processor on which the piece of software can be executed - or in a particular embodiment, the calculation module can be provided as a hardware module coupled to the hardware measurement module. In the preferred embodiment, the detection module can be provided as a piece of software that can be executed on a computer processor (the same computer processor as above or another computer processor), or can be otherwise adapted to obtain the calculated number of allele frequencies from the calculation module, for example via the data communication channel or via another suitable data communication channel. The estimation module can be provided as a piece of software that can be run on a computer processor (the same computer processor as above, or the same other computer processor as above, or yet another computer processor), or can be otherwise modified to obtain the detected subset, e.g. via the data communication channel or via another suitable data communication channel. In a particular embodiment, the system comprises a hardware measurement module coupled to one or more computer processors with associated computer memory storage. The measurement module, the detection module and the estimation module are present as software modules that are installed in the associated computer memory store, or that can be activated or otherwise requested by the system to run on the one or more computer processors.
Those skilled in the art will appreciate that the steps of any of the preferred and further developed embodiments of the method disclosed above can be performed by the respective corresponding modules of the system.
According to a preferred embodiment, the system comprises a bias correction module adapted to correct allele-specific bias for at least one of the predetermined number of genetic markers, which allele-specific bias is due to unequal amplification for a reference allele and an alternative allele for a genetic marker.
According to a preferred embodiment, the system comprises a restriction module adapted to perform the step of limiting the detected subset of any of the embodiments of the method disclosed above.
Short figure description
The accompanying drawings are used to illustrate non-limiting exemplary embodiments of devices of the present invention that are currently preferred. The above described and other advantages of the features and objects of the invention will become more apparent and the invention will be better understood with reference to the following detailed description when read in conjunction with the accompanying drawings, in which:
Figure 1 schematically illustrates a point chart showing data for a sample as used in an embodiment of a method according to the present invention;
Fig. 2 schematically illustrates another point graph showing data for a sample as used in an embodiment of a method according to the present invention;
Figure 3 schematically illustrates an example optimization as performed in an embodiment of a method according to the present invention; and Figure 4 schematically illustrates an exemplary point chart showing data from a plurality of samples as used in an embodiment of a method according to the present invention.
Detailed description of exemplary embodiments
In a Non-Invasive Prenatal Test (NIPT), known in the art, cell-free DNA (cfDNA) is sequenced in a mature serum or plasma sample from a pregnant woman to detect the presence of chromosomal aneuploidies in the detect fetus, such as chromosome trisomy 21. One factor to estimate when interpreting the data is the fetal fraction (FF), which is the fraction of the cell-free DNA in the matte male sample from the fetus. Typical fetal fractions are between 2% and 6%, or even between 0.39% and 11.4%. According to exemplary embodiments of the invention, a method is provided for estimating the fetal fraction using applicant's Clarigo test data that does not include the detection of SNP alleles on the fetal DNA that are not present in the DNA of the pregnant woman woman.
The Clarigo test consists of targeting sequentially a number of regions on the human genome (in other words, targeting specific genetic markers), using known SNPs (single-nucleotide polymorphism - single-nucleotide polymorphism) with large (e.g., greater than 1%, preferably greater than 10%) population prevalence and two possible alleles (i.e., a reference allele also known as REF; and an alternative allele also known as ALT). More details about the Clarigo test can be found on the internet at http://www.multiplicom.com/product/clarigo, and in WO 2013/057568 in the name of the Applicant.
Prior art document EP 0 994 063 discloses non-invasive prenatal diagnostic methods, for example to detect certain diseases. The prenatal diagnostic methods include estimating the fetal fraction. Known estimation methods for estimating the fetal fraction include the detection of SNP alleles on the fetal cell-free DNA that are not present in the matte DNA. In other words, these known estimation methods investigate SNP alleles in which the matte DNA is homozygous.
Embodiments of the present invention aim to provide an approach to estimating the fetal fraction.
Basing the estimate of the fetal fraction on studies of these alleles can limit reliability. Particular embodiments of the present invention may therefore be intended to permit verification of the findings of known estimation methods. These particular embodiments are based inter alia on the inventive insight that estimating the fetal fraction based on the detected subset of the number of genetic markers allows to verify the fetal fraction as estimated in known estimation methods, independently of the SNP alleles investigated the matte DNA is homozygous. This verification improves the reliability of the fetal fraction as estimated in known estimation methods, by measuring the accuracy of those estimates.
Moreover, it may be particularly advantageous to combine the known estimation methods with embodiments of the present invention to further improve the reliability of the estimated fetal fraction. By combining estimation methods that include the detection of SNP alleles on the fetal cell-free DNA that are not present in the matte DNA and embodiments of the present invention, the reliability of the estimated fetal fraction can be further improved, since estimates may be based on SNP - alleles in which the matte DNA is homozygous as well as on SNP alleles (or short insertions and deletions) in which the matte DNA is heterozygous.
In a typical embodiment, the matrimal serum or plasma sample is derived from maternal blood. This can be a small amount of serum or plasma, for example 1-20 ml. Depending on the desired accuracy, it may be preferable to use larger volumes. The preparation of the serum or plasma of the matte blood sample can be performed using standard techniques. Suitable techniques include: centrifugation, matrix-based techniques, etc. In possible embodiments, a sequence-based enrichment method can be used on the matrimal serum or plasma to enrich it specifically for fetal nucleic acid sequences.
Embodiments of the method of the invention may include determining whether or not a sample contains fetal DNA at a fetal fraction greater than a predetermined threshold value.
In preferred embodiments, an amplification of the fetal DNA sequences is performed in the sample. Any amplification method known to those skilled in the art can be used, such as a PCR (polymerase chain reaction, polymerase chain reaction) method.
A first embodiment of a method for estimating a fetal fraction will now be discussed in detail. Allelic events are measured for a predetermined number of genetic markers in a sample of cell-free DNA from a pregnant woman. Each allele presence represents the presence on a genetic marker of the predetermined number of genetic markers from at least one of: a reference allele of maternal or fetal origin, and an alternative allele of maternal or fetal origin.
In other words, for a number of DNA sequences with a known location on a chromosome, it is determined whether or not, and how often, either a reference allele, an alternative allele, or both are present. The alleles can be of matemal or fetal origin, respectively, since the measurements are performed on a sample of cell-free DNA from a pregnant woman, and therefore contain DNA from both the woman and the fetus.
An advantageous way to present the results of measuring allele presence for a genetic marker is to associate the following information with a variant data point for that genetic marker. A variant data point (which is a data point associated with a number of variants, such as alleles) is used in this specification as a handy representation for a genetic mark, and thus represents the result of measuring allele presence in a number of genetic markers amplicons. An amplicon is a piece of DNA or RNA that is (the source and / or) the product of amplification or replication events - in other words, an amplicon is a biophysical piece of replication material designed to contain a known SNP position with a high population prevalence (e.g. greater than 1%, preferably greater than 10%). Each variant data point is thus associated with a known SNP with high population prevalence (e.g. greater than 1%, preferably greater than 10%) and with two possible alleles (i.e., a reference allele also known as REF; and an alternative allele also known as ALT). For each variant data point □□, the following numbers can be determined using, for example, a standard bioinformatics pipeline applied to the sequencing data: • The number of readings containing the REF allele at the known SNP position,
• The number of readings that contain the ALT allele at the known SNP position,
• The total coverage
• The allele frequency, or the fraction of ALT allele readings on the total coverage
Therefore, for a given genetic marker i, the allele presentations can be measured for both the REF allele, for the ALT allele and for both alleles, by measuring the numbers of readings that respectively the REF allele, the ALT allele and contain both the REF and the ALT allele. Based on the measured allele presence, a corresponding number of allele frequencies is calculated for the predetermined number of genetic markers.
For each position in the genome (i.e. for each genetic locus) - excluding the sex chromosomes and assuming that there are no relevant chromosome disorders, there are four copies present in the sample (assuming the position is not part of an aneuploid region) , which determine the total number of readings: two copies of the matte DNA and two copies of the fetal DNA.
For an individual variant data point (i.e. for an individual genetic marking), A and B have the REF and ALT alleles indicated for the known SNP on the matte DNA for that genetic marking, and a and b the corresponding states for the fetal DNA . This means that the variant data point in the possible states listed in Table 1 can be:
Table 1
As an illustration, the dot graph illustrated in Figure 1 shows data for a sample, where: • Each point is a variant data point (and represents the result of measurements made on amplicons for a given genetic marker). • The horizontal axis shows the fraction read-in with the ALT-SNP allele (ie the allele frequency Fj). • The vertical axis shows the total reading coverage CTi.
It can be seen from Figure 1 that specific variant data points are associated with a specific allele frequency.
For example, variant data point 111 is shown on the left, and it represents a genetic marker for which approximately 2800 readings were performed. All or nearly all readings for this genetic marker have measured that the allele presence indicates that REF alleles (i.e., A) are present, but not that ALT alleles are present (i.e., B). Consequently, variant data point 111 is plotted on the left in Figure 1, where the allele frequency is (almost) 0, and probably represents a homozygous genetic marker AAaa.
Variant data point 112 is shown on the right, and represents a genetic marker for which approximately 2,700 readings were made. All or nearly all readings for this genetic marker have measured that the allele presence indicates that REF alleles (i.e., A) are not present, but that ALT alleles (i.e., B) are present. Consequently, variant data point 112 is plotted on the right-hand side in Figure 1, where the allele frequency is (almost) 1, and probably represents a homozygous genetic marker BBbb.
Variant data point 121 represents a genetic marker for which approximately 3000 readings were made. The measured allele presence makes it possible to calculate the allele frequency, which is relatively low, but not 0. Consequently, variant data point 121 probably suggests a genetic marker that is all times homozygous for the reference allele pair but has a heterozygous allele pair of fetal origin (so AAab) , since the fraction of DNA template present in the sample for the genetic marking is (much) larger than the fraction of fetal DNA present in the sample for the genetic marking.
Variant data point 122 is shown on the right, and represents a genetic marker for which approximately 1800 readings were made. Because of the same reasoning, variant data point 122 probably suggests a genetic marker that is at least homozygous for the alternative allele pair but has a heterozygous allele pair of fetal origin (ie BBab).
Variant data point 131 represents a genetic marker for which approximately 2,700 readings were made. The measured allele presence makes it possible to calculate the allele frequency, which is found to be approximately 0.51. Consequently, variant data point 131 is likely to represent a genetic marker that is almost heterozygous (ie AB). It is not prima facie that the variant data point would represent a genetic marker that is also heterozygous for the fetus, although this is likely given the large overall read-in coverage.
Variant data point 132 represents a genetic marker for which approximately 800 readings were made. The measured allele presence makes it possible to calculate the allele frequency, which is found to be approximately 0.68. Consequently, variant data point 132 is likely to represent a genetic marker that is almost heterozygous (ie AB). Since the variant data point 131 represents a greater number of readings for a specific genetic marking than variant data point 132, variant data point 131 is statistically more reliable than variant data point 132.
Consequently, three groups of variant data points (1 IA and 1 IB, 12A and 12B, and 13) can be distinguished: • variant data points 1 IA and 11B that are homozygous in the matrimal and fetal DNA (AAaa, BBbb). Variant data points 12A and 12B that are homozygous in the matte DNA, and heterozygous in the fetal DNA (AAab, BBab). Note that in these cases, the fetal DNA contains an allele inherited from the father and not present in the matte DNA. Variant data points 13 that are heterozygous in the matte DNA (ABaa, ABab, ABbb). Note that for each of these variant data points, the fetal DNA contains only alleles that are also present in the matte DNA.
It is noted that several variant data points may have the same (or almost the same) allele frequency, especially if they are part of the same group. This means that (almost) the same number of allele present was measured for them, compared to the total number of readings.
It is also noted that, in Figure 1, variant data points with greater total read-in coverage (closer to the top of the graph) have a more accurate allele frequency simply because there is more measurement data. This explains why the groups 11A-B, 12A-B and 13 of variant data points in Figure 1 have a generally tapered shape. This property can be taken into account when determining a statistical reliability for a given variant data point.
The fetal fraction can be estimated as follows.
Based on the calculated number of allele frequencies, a subset of the predetermined number of genetic markers is detected. The subset is associated with heterozygous allel pairs of matemal origin. Accordingly, the detected subset preferably includes genetic markers whose variant data points are in the states ABaa, ABab, and ABbb, and preferably do not include genetic markers whose variant data points are in the states AAaa and BBbb, and AAab and BBab. For the sake of clarity, when a variant data point is said to be in a state, it is meant that amplicons for its genetic marking are read in as being in that state - in other words, measurements for that genetic marking have shown that it is in that state ( and therefore that that genetic marker is, for example, heterozygous in side maternal DNA, which can be represented in the present specification by a variant data point in one of the states ABaa, ABab, or ABbb).
Note that this detection does not necessarily include the detection of (SNP) alleles on the fetal cell-free DNA that are not present in the matte DNA. In particular, (SNP) alleles are detected on the fetal cell-free DNA present in the matte DNA, since the maternal DNA is heterozygous for the detected subset: whether the alleles are of the fetal origin aa, ab or bb, each of those alleles belongs to the pregnant woman, who has AB.
In a specific embodiment, this subset is detected, based on the calculated number of allele frequencies, by selecting from the predetermined number of genetic markers, as that subset, a second number of genetic markers whose corresponding allele frequencies exceed a predetermined minimum allele frequency threshold and a predetermined minimum certain maximum allele frequency threshold. Example thresholds can be 0.3 and 0.7, or 0.4 and 0.6 respectively. Example minimum and maximum thresholds can be predetermined on the basis of one or more known calibration samples.
Based on the detected subset, the fetal fraction (i.e., the fraction of cell-free DNA of fetal origin in the sample) is estimated. This can be understood from the indication above that at least some of the genetic markers of the detected subset (e.g. those in the ABaa and ABbb states) are probabilistically related to the fetal fraction. Consequently, the fetal fraction can be estimated based on this detected subset. Genetic markings (represented by variant data points □□ representing readings made on amplicons) of the detected subset are heterozygous in the maternal DNA and can therefore be either ABaa, ABab, or ABbb. However, the actual state is not known, and it is impossible to determine it for an individual variant data point since the distributions of the ABaa, ABab, and ABbb states overlap for typical fetal fraction values.
In a specific embodiment, the estimate may include producing an initial estimate for the fetal fraction and optimizing the initial estimate over the detected subset.
In a further developed embodiment, optimizing may include maximizing the probability of the initial estimate - i.e., finding an estimate for the fetal fraction, based on the initial estimate, which most likely characterizes the actual fetal fraction. This can be done by repeatedly performing the next set of operations while the estimate is varied (i.e., taking a different value for the estimate that is its initial value): for a given value of the estimate being varied, calculating , for each genetic marker in the detected subset, of the respective probabilities that that genetic marker is associated with a homozygous reference allele pair of fetal origin, a heterozygous allele pair of fetal origin, or a homozygous alternative allele pair of fetal origin; and calculating a probability of the given value of the estimate being varied for the detected subset, based on the calculated respective probabilities for each genetic marker in the detected subset.
In an exemplary embodiment, to determine the probability of a given estimate for the fetal fraction, the respective probabilities are calculated that the observed REF and ALT read numbers correspond to a genetic mark with a particular state of ABaa, ABab, and ABbb, and with the given estimate for the fetal fraction. This calculation can be performed taking into account a statistical distribution assumption for the read-in numbers - for example, using the assumption that the read-in numbers follow a Poison distribution.
In the exemplary embodiment, a general probability can optionally be calculated from the measured allele presence, i.e. from the observed REF and ALT readings for each variant data point with the given estimate for the fetal fraction by taking the greatest probability of the respective probabilities calculated for the possible states of ABaa, ABab and ABbb.
In the exemplary embodiment, the probability of the given estimate is calculated based on the calculated respective probabilities for each genetic marker in the detected subset - in particular for this embodiment, by taking the product of all general probabilities over the detected subset. This probability can then be maximized, for example using a non-linear optimization method.
For some variant data points in the test, the amplification in sample processing to measure the allele attributes does not increase the REF and ALT alleles in equal proportions. The result is an allele-specific prejudice.
This bias can be corrected for a given genetic marking, by analyzing a batch of several samples, preferably simultaneously, and measuring the average allele frequency of the variant data points for the given genetic marking over all samples in the batch, which variant data points are heterozygous are on the matemale DNA. If this average is based on a sufficiently large number of samples, its deviation from the expected value 0.5 can be used to establish a correction factor for each variant data point.
The accuracy of estimation of the fetal fraction can be improved by limiting the selected variant data points to a high quality subset. This limitation can be based on an individual target sample, or on a batch of monters. Examples of filters for such a limitation include one or more of the following approaches: • Reject variant data points where total coverage is below a predetermined specific threshold for the target sample; and • for each variant data point, calculate the standard deviation of the allele frequencies across all samples in a batch, using only those samples where the variant data point is heterozygous in the DNA template; and reject the variant data point if this standard deviation is above a predetermined specific threshold.
Figure 2 schematically illustrates another point graph showing data for a sample as used in an embodiment of a method according to the present invention. The dot graph shows the allele frequency Ir of each variant data point as a function of its total coverage □□□ In other words, for each genetic marker that is considered, the vertical axis shows how many allele presences were measured in total (REF and ALT combined) for those genetic marking, and the horizontal axis shows how frequently the presence of the ALT allele has been measured for that genetic marking.
A total of 3638 variant data points are shown in Figure 2. As for Figure 1, three groups of variant data points (21A-B, 22A-B and 23) can be distinguished: • 1947 variant data points 21A and 21B are purely homozygous and set variant data point states AAaa (namely 21A) and BBbb (i.e., 21B) for. • 647 variant data points 22A and 22B represent variant data point states AAab (namely 22A) and BBab (namely 22B). • 1044 variant data points 23 are heterozygous and represent all other variant data point states (namely ABaa, ABab and ABbb).
The vertical lines 24 and 25 at allele frequencies 0.3 and 0.7 indicate exemplary threshold values to separate the heterozygous variant data points (i.e. the heterozygous genetic markers) from the rest of the variant data points. These variant data points are used in this embodiment to estimate the fetal fraction, as explained below. Those skilled in the art will understand that other minimum and maximum threshold values could have been chosen, for example 0.2 and 0.8, or 0.4 and 0.6, or other values such as 0.301 and 0.699, or even combinations thereof, such as 0.2 and 0.7, or 0.3 and 0.6, or 0.4 and 0.8, etc.
Late
one of the variant data points in this example is shown in Figure 1 with total coverage
= 1697, with 911 readings containing the REF allele at the known SNP position (
= 911) and 786 readings contain the ALT allele at the known SNP position (
= 786). The allele frequency
falls between the minimum and maximum threshold for heterozygosity and thus variant data point üD is selected as part of the subset detected in the method embodiment and used to estimate the fetal fraction.
Figure 3 schematically illustrates an example optimization as performed in an embodiment of a method according to the present invention. Figure 3 shows the estimated fetal fraction FF horizontally, and shows the aggregated (logarithmic) probability of a corresponding estimated fetal fraction FF vertically.
Late
= 4% are an initial estimate of the fetal fraction FF. Using this estimate of the fetal fraction and assuming that the number of readings follows a Poisson distribution, probability values can be calculated that indicate how well variant data point
corresponds to the ABaa, ABab or ABbb states. For the sake of clarity of representation, the natural logarithm of the probability values is given here. The respective logarithm values of the probabilities calculated by the method embodiment in this example are: • Condition ABaa: -9.9, • Condition ABab: -13.4, • Condition ABbb: -19.4.
This means that state ABaa is the most likely state for
is given the current value of the estimate
for the fetal fraction FF. When the logarithmic values of the probabilities of all variant data points are aggregated, that is, for all genetic markers in the detected subset, the logarithmic probability Q is of
= 4% -7916.08.
Then vary the estimate so that it
= 10%. Then the logarithmic values of the probabilities are for variant data point
: • ABaa state: -9.0, • ABab state: -13.4, • ABbb state: -32.9.
This means that state ABaa is the most likely state for
is, given the new value of the cute
for the fetal fraction FF. The logarithmic probability Q for
= 10% over all genetic markers in the detected subset is -7672.53. This probability is greater than the previous estimate, which
= 10% makes a more likely candidate for the actual fetal fraction.
Analog calculations can be performed by embodiments of the method for varying estimates
of the fetal fraction (e.g. 301-306) using a non-linear optimization method, which gives curve 31. Probabilities of fetal fraction estimates for this example are shown in Figure 3. The maximum probability is reached (at line 33) for
= 9.5% (at line 32).
Figure 4 schematically illustrates an exemplary point chart showing data from multiple samples as used in an embodiment of a method according to the present invention, and provides an example of possible filtering of unreliable variant data points by limiting the detected subset of genetic markers. The data represents a batch of cell-free DNA samples from pregnant women, for which samples a genetic marking is associated with heterozygous allele pairs of matemal origin (i.e., ABaa, ABab, or ABbb). For each heterozygous data point in the example batch, the average allele frequency <F;> in those cases where it is all the time heterozygous (x-axis) versus the standard deviation SD from those allele frequencies (namely for those cases where the genetic markers are all heterozygous) .
For an optional first constraint, the standard deviation of the allelic sequence can be calculated for the genetic marking over the plurality of samples. On the basis of the calculated standard deviation of the allele frequency, at least one genetic marker can be excluded from the detected subset (preferably all genetic markers that deviate more than a predetermined deviation threshold).
For an optional second constraint that can be used separately or together with the first constraint, the average allelic sequence can be calculated for the genetic marking over the plurality of samples. Based on a deviation of the calculated average allele frequency from an expected value thereof, at least one genetic marker can be excluded from the detected subset (preferably all genetic markers whose average is too far away).
The lines 41, 42 and 43 indicate exemplary thresholds for filtering to use in such exclusions. Variant data points with large standard deviation (for example those above threshold 43) are filtered (i.e. their corresponding genetic markers are excluded from the detected subset), as well as variant data points whose average heterozygous allele frequency deviates too far from 0.5 (for example those outside thresholds 41 and 42) , thus presenting allele-specific prejudice in the data. This leaves the data cluster 40 of higher quality (more reliability).
One skilled in the art would readily recognize that steps of various methods described above can be performed by programmed computers. Hereby, some embodiments are also intended to cover program storage devices, for example, digital storage media that are machine or computer readable and encode machine executable or computer executable instruction programs, the instructions executing some or all of the steps of the methods described above. The program storage devices can be, for example, digital memories, magnetic storage media such as magnetic disks and magnetic tapes, hard disks, or optically readable digital data storage media. It is also intended that the embodiments cover computers that are programmed to perform the steps of the methods described above.
The functions of the various elements shown in the figures, including functional blocks labeled as "processors" or "modules", can be provided through the use of dedicated hardware as well as hardware capable of executing software in accordance with suitable software. If provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which are shared. In addition, explicit use of the term "processor" or "controller" should not be construed as referring exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processing (DSV) hardware, network processor , application-specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage Other hardware, conventional and / or specific Similarly, switches shown in the figures are only conceptual and their function can be performed by operating program logic, by means of specific logic, by interaction of program control and specific logic, or even manually, whereby the specific technique can be selected by a person who implements the function as specifically understood from the context.
Those skilled in the art will appreciate that block diagrams show conceptual views of exemplary circuits in which the principles of the invention are implemented. Similarly, it will be understood that graphs, flow charts, state transition schedules, pseudocode, and the like represent various processes that may be substantially represented by a computer-readable medium, and thus may be executed by a computer or processor, even though such a computer or processor not explicitly shown.
Although the principles of the invention have been set forth above in connection with specific embodiments, it is to be understood that this description is made merely as an example and not as a limitation on the scope of protection defined by the appended claims.
权利要求:
Claims (17)
[1]
Conclusions
An estimation method for estimating a fetal fraction, which method comprises: - measuring allelic events for a predetermined number of genetic markers in a sample of cell-free DNA from a pregnant woman, wherein each allelic presence represents the presence on a genetic mark of the fetal fraction predetermined number of genetic markers of at least one of: a reference allele of matemal or fetal origin, and an alternative allele of maternal or fetal origin; - on the basis of the measured allele presence, calculating a corresponding number of allele frequencies for the predetermined number of genetic markers; - on the basis of the calculated number of allele frequencies, detecting a subset of the predetermined number of genetic markers, wherein the subset is associated with heterozygous allele pairs of matematic origin; and - estimating the fetal fraction based on the detected subset, wherein the fetal fraction represents the fraction of cell-free DNA of fetal origin in the sample.
[2]
Estimation method according to claim 1, wherein detecting the subset comprises: - based on the calculated number of allele frequencies, selecting from the predetermined number of genetic markers, as the subset, a second number of genetic markers whose corresponding allele frequencies have a predetermined minimum allele frequency threshold and do not exceed a predetermined maximum allele frequency threshold.
[3]
The estimation method according to any of the preceding claims, wherein the estimation comprises: - producing an initial estimate for the fetal fraction; and - optimizing the initial estimate about the detected subset.
[4]
Estimation method according to claim 3, wherein optimizing comprises maximizing the probability of the initial estimate, by repeatedly performing the following set of operations while the estimate is varied: - calculating for a given value of the estimate being varied , for each genetic marker in the detected subset, of the respective probabilities that the genetic marker is associated with a fetal-origin homozygous reference allele pair, a fetal-origin heterozygous allele pair, and a fetal-origin homozygous alternative pair; and - calculating a probability of the given value of the estimate being varied, for the detected subset, based on the calculated respective probabilities for each genetic marker in the detected subset.
[5]
5. Estimation method according to one of the preceding claims, comprising correcting allele-specific bias for at least one of the predetermined number of genetic markers, which allele-specific bias is due to unequal amplification for a reference allele and an alternative allele for a genetic marker.
[6]
Estimation method according to claim 5, wherein correcting allele-specific bias for a genetic marking comprises: - obtaining a plurality of samples of cell-free DNA from pregnant women, for which samples the genetic marking is respectively associated with heterozygous alleles from matemale origin; - calculating the average allele frequency for the genetic marking over the plurality of samples; and - determining a correction factor to use for correcting based on a deviation of the calculated average allele frequency from an expected value thereof.
[7]
7. Estimation method according to any one of the preceding claims, comprising limiting the detected subset by: - determining for at least one genetic marking of the detected subset of a statistical probability associated with a homozygous allele pair of matematic origin; and - based on the determined statistical probability or probabilities, the exclusion of at least one genetic marker from the detected subset.
[8]
Estimation method according to any one of the preceding claims, comprising limiting the detected subset by: - obtaining a plurality of samples of cell-free DNA from pregnant women, for which samples a genetic marking is associated with heterozygous allele pairs of matematic origin; - calculating the standard deviation of the allele frequency for the genetic marking over the plurality of samples; and - based on the calculated standard deviation of the allele frequency, excluding at least one genetic marker from the detected subset.
[9]
Estimation method according to any one of the preceding claims, comprising limiting the detected subset by: - obtaining a plurality of samples of cell-free DNA from pregnant women, for which samples the genetic marking is respectively associated with heterozygous allele pairs of matematic origin; - calculating the average allele frequency for the genetic marking over the plurality of samples; and - on the basis of a deviation of the calculated average allele frequency from an expected value thereof, excluding at least one genetic marker from the detected subset.
[10]
10. Estimation method according to any one of the preceding claims, comprising limiting the detected subset by: - excluding at least one genetic tag from the detected subset, which at least one genetic tag, for the samples used, has fewer measured allele presences than a predetermined allele presence threshold.
[11]
The estimation method according to any of the preceding claims, wherein the sample is blood, plasma, urine, cerebrospinal fluid, serum, saliva or transcervical flushing fluid.
[12]
The estimation method according to any of the preceding claims, wherein the obtaining comprises at least one of the following: polymerase chain reaction, polymerase chain reaction, ligase chain reaction, nucleic acid sequence-based amplification (nucleic acid sequence based amplification, NASBA), and branched DNA methods; and preferably PCR.
[13]
A computer program product comprising computer executable instructions to, when the program is run on a computer, perform at least the step of estimating the fetal fraction of the method of any one of the preceding claims.
[14]
A digital storage medium encoding a machine-executable program of instructions to at least perform the step of estimating the fetal fraction of the method of any one of claims 1-12.
[15]
15. Estimation system for estimating a fetal fraction, which system comprises: - a measurement module adapted to measure allelic events for a predetermined number of genetic markers in a sample of cell-free DNA from a pregnant woman, each allelic presence representing the presence on a genetic marking of the predetermined number of genetic markings of at least one of: a reference allele of maternal or fetal origin, and an alternative allele of matemal or fetal origin; - a calculation module arranged for, on the basis of the measured allele presence, calculating a corresponding number of allele frequencies for the predetermined number of genetic markers; - a detection module arranged for, on the basis of the calculated number of allele frequencies, detecting a subset of the predetermined number of genetic markers, wherein the subset is associated with heterozygous allele pairs of matematic origin; and an estimation module adapted to estimate the fetal fraction based on the detected subset, the fetal fraction representing the fraction of cell-free DNA of fetal origin in the sample.
[16]
16. Estimation system according to claim 15, comprising a bias correction module adapted to correct allele-specific bias for at least one of the predetermined number of genetic markers, which allele-specific bias is due to unequal amplification for a reference allele and an alternative, respectively allele for a genetic marking.
[17]
17. Estimation system according to claim 15 or 16, comprising a restriction module adapted to perform the step of limiting the detected subset of one of claims 7-10.
类似技术:
公开号 | 公开日 | 专利标题
Narasimhan et al.2016|BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data
US11217330B2|2022-01-04|Size-based analysis of fetal DNA fraction in plasma
Walsh et al.2017|Defining the genetic architecture of hypertrophic cardiomyopathy: re-evaluating the role of non-sarcomeric genes
Teo et al.2012|Statistical challenges associated with detecting copy number variations with next-generation sequencing
Zhao et al.2018|Strategies for processing and quality control of Illumina genotyping arrays
Teare et al.2013|Allele‐dose association of the C5orf30 rs26232 variant with joint damage in rheumatoid arthritis
Heinrich et al.2013|Estimating exome genotyping accuracy by comparing to data from large scale sequencing projects
Tian et al.2019|Estimating the genome-wide mutation rate with three-way identity by descent
Prans et al.2013|Copy number variations in IL22 gene are associated with Psoriasis vulgaris
Pool2016|Genetic mapping by bulk segregant analysis in Drosophila: experimental design and simulation-based inference
Ellison et al.2016|Using targeted sequencing of paralogous sequences for noninvasive detection of selected fetal aneuploidies
Dou et al.2021|Using off-target data from whole-exome sequencing to improve genotyping accuracy, association analysis and polygenic risk prediction
BE1023274B1|2017-01-19|Estimation method and system for estimating a fetal fraction
BE1022771B1|2016-08-31|Method and system for determining whether a woman is pregnant based on a blood sample
BE1022789B1|2016-09-06|Method and system for gender assessment of a fetus of a pregnant woman
Groza et al.2019|Personalized and graph genomes reveal missing signal in epigenomic data
Tsui et al.2013|Noninvasive prenatal diagnosis using next-generation sequencing
Marderstein et al.2019|Age, Sex, and Genetics Influence the Abundance of Infiltrating Immune Cells in Human Tissues
Papenfuss et al.2016|Bioinformatics Analysis of Sequence Data
同族专利:
公开号 | 公开日
BE1023274A9|2017-03-17|
BE1023274A1|2017-01-19|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
WO2012142334A2|2011-04-12|2012-10-18|Verinata Health, Inc.|Resolving genome fractions using polymorphism counts|
WO2013057568A1|2011-10-18|2013-04-25|Multiplicom Nv|Fetal chromosomal aneuploidy diagnosis|
WO2014209597A2|2013-06-28|2014-12-31|Ariosa Diagnostics, Inc.|Massively parallel sequencing of random dna fragments for determination of fetal fraction|
WO2015026967A1|2013-08-20|2015-02-26|Natera, Inc.|Methods of using low fetal fraction detection|
法律状态:
2021-05-07| PD| Change of ownership|Owner name: AGILENT TECHNOLOGIES, INC.; US Free format text: DETAILS ASSIGNMENT: CHANGE OF OWNER(S), ASSIGNMENT Effective date: 20210315 |
优先权:
申请号 | 申请日 | 专利标题
BE20155460A|BE1023274A9|2015-07-17|2015-07-17|Estimation method and system for estimating a fetal fraction|BE20155460A| BE1023274A9|2015-07-17|2015-07-17|Estimation method and system for estimating a fetal fraction|
[返回顶部]